NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Zhu, Yifan; Pan, Yekai; Ding, Chen (March 2026, Proceedings of the Workshop On General Purpose Processing Using GPU (GPGPU))

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.
more » « less
Full Text Available
Continuous-Time Modeling of Zipfian Workload Locality

https://doi.org/10.1145/3746450

Wang, Yiyang; Ding, Chen; Sciortino, Leo; Chen, Linlin (July 2025, ACM Transactions on Modeling and Performance Evaluation of Computing Systems)

Traditional workload analysis uses discrete times measured by data accesses. An example is the classic independent reference model (IRM). Effective solutions have been developed to model workloads with stochastic access patterns, but they incur a high cost for Zipfian workloads, which may contain millions of items each accessed with a different frequency. This paper first presents a continuous-time model of locality for workloads with stochastic access patterns. It shows that two previous techniques by Dan and Towsley and by Denning and Schwartz can be interpreted as a single model using different discrete times. Using continuous time, it derives a closed-form solution for an item and a general solution that is a differentiable function. In addition, the paper presents an approximation technique by grouping items into partitions. When evaluated using Zipfian workloads, it shows that a workload with millions of items can be approximated using a small number of partitions, and the continuous-time model has greater accuracy and is faster to compute numerically. For the largest data size verifiable using trace generation and simulation, the new techniques reduce the time of locality analysis by 6 orders of magnitude.
more » « less
Full Text Available
Symmetric Locality: Definition and Initial Results

https://doi.org/10.1109/SCW63240.2024.00142

Escalona, Giordan; McKellips, Dylan; Ding, Chen (November 2024, IEEE)

Full Text Available
Measuring Data Access Latency in Large CPU Caches

https://doi.org/10.1145/3695794.3695806

Sun, Shaotong; Zhu, Yifan; Ye, Xingzhi; Ding, Chen (September 2024, ACM)

Full Text Available
Parallel Loop Locality Analysis for Symbolic Thread Counts

https://doi.org/10.1145/3656019.3676948

Liu, Fangzhou; Zhu, Yifan; Sun, Shaotong; Ding, Chen; Smith, Wesley; Hosseini, Kaave Seyed (October 2024, ACM)

Full Text Available
Implementation of a Two-Level Programmable Cache Emulation and Test System

https://doi.org/10.1145/3695794.3695821

Figorito, Marcus; Michelini, Vincent; Reber, Benjamin; Kneipp, Alexander; Gould, Matthew; Ding, Chen; Chen, Linlin; Patru, Dorin (September 2024, ACM)

Full Text Available
Memory Workload Synthesis Using Generative AI

https://doi.org/10.1145/3631882.3631889

Shi, Chengao; Jiang, Fan; Liu, Zhenguo; Ding, Chen; Xu, Jiang (October 2023, Proceedings of the International Symposium on Memory Systems, {MEMSYS} 2023, Alexandria, VA, USA, October 2-5, 2023)
E=MC^2: Efficient Mobility Centric Caching

https://doi.org/10.1145/3631882.3631892

Ding, Chen; Reber, Ben; Patru, Dorin; Kneipp, Alexander H; Figorito, Marcus (October 2023, Proceedings of the International Symposium on Memory Systems, {MEMSYS} 2023, Alexandria, VA, USA, October 2-5, 2023)
Blast from the Past: Least Expected Use (LEU) Cache Replacement with Statistical History

https://doi.org/10.1145/3591195.3595267

Chakraborti, Sayak; Zhang, Zhizhou; Bertram, Noah; Ding, Chen; Dwarkadas, Sandhya (June 2023, ACM SIGPLAN International Symposium on Memory Management)
Erek Petrank and Steve Blackburn (Ed.)
Cache replacement policies typically use some form of statistics on past access behavior. As a common limitation, how- ever, the extent of the history being recorded is limited to either just the data in cache or, more recently, a larger but still finite-length window of accesses, because the cost of keeping a long history can easily outweigh its benefit. This paper presents a statistical method to keep track of instruction pointer-based access reuse intervals of arbitrary length and uses this information to identify the Least Ex- pected Use (LEU) blocks for replacement. LEU uses dynamic sampling supported by novel hardware that maintains a state to record arbitrarily long reuse intervals. LEU is evaluated using the Cache Replacement Championship simulator, tested on PolyBench and SPEC, and compared with five policies including a recent technique that approximates optimal caching using a fixed-length history. By maintaining statistics for an arbitrary history, LEU outperforms previous techniques for a broad range of scientific kernels, whose data reuses are longer than those in traces traditionally used in computer architecture studies.
more » « less
Full Text Available
Cache Programming for Scientific Loops Using Leases

Reber, Benjamin; Gould, Matthew; Kneipp, Alexander H.; Liu, Fangzhou; Prechtl, Ian; Ding, Chen; Chen, Linlin; Patru, Dorin (July 2023, ACM transactions on architecture and code optimization)

Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.
more » « less
Full Text Available

« Prev Next »

Search for: All records